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Abstract —Scene classification is a fundamental perception 
task for environmental understanding in today’s robotics. In 
this paper, we have attempted to exploit the use of popular 
machine learning technique of deep learning to enhance scene 
understanding, particularly in robotics applications. As scene 
images have larger diversity than the iconic object images, it is 
more challenging for deep learning methods to automatically 
learn features from scene images with less samples. Inspired 
by human scene understanding based on object knowledge, 
we address the problem of scene classification by encouraging 
deep neural networks to incorporate object-level information. 
This is implemented with a regularization of semantic segmen¬ 
tation. With only 5 thousand training images, as opposed to 
2.5 million images, we show the proposed deep architecture 
achieves superior scene classification results to the state-of-the- 
art on a publicly available SUN RGB-D dataset. In addition, 
performance of semantic segmentation, the regularizer, also 
reaches a new record with refinement derived from predicted 
scene labels. Finally, we apply our SUN RGB-D dataset trained 
model to a mobile robot captured images to classify scenes in 
our university demonstrating the generalization ability of the 
proposed algorithm. 


1. Introduction 

Today’s robotics face many perception challenges such 
as scene classification (Figure [^, semantic segmentation, 
object recognition and detection. For object-level tasks, a 
series of new performance standards are set with the recently 
successful deep Convolutional Neural Networks (CNN) 11]- 
13], while the performance on scene-level perception based 
on deep CNN did not reach the same level of success before 
the work of Place-CNN 14]. As pointed out in 14], scene- 
level task is more challenging for feature learning due to 
the larger diversity of scene images compared to iconic 
object images. For Place-CNN, it overcame this diversity 
and reached state-of-the-art by training on 2.5 million scene 
images. However, it is very expensive to collect and label the 
training images in such a large scale. Furthermore, enhancing 
the performance by increasing the number of training sam¬ 
ples is not preferable in most robotic applications, especially 
for those tasks with insufficient samples. In this paper, we 
focus on constructing a scene classifier with competitive 
performance, while automatically learns feature with less 
amount of training images using the deep CNN. 

It is more likely that the human beings understand the 
scene classes mainly according to the object-level informa- 
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Fig. 1. Scene classification demonstration. The examples are captured in 
our university using the mobile robot. Our SS-CNN trained on SUN RGB-D 
dataset gives the predict labels below each image, without retraining for the 
completely new environment. Labels in black means correct classification. 
Two misclassified image are given with labels in red, while the predicted 
labels are in accord with human recognition to some extent. 


tion, as scene classes are naturally defined at a higher level 
than the objects. For example, we incline to recognize the 
scene as “bedroom” as we find the objects “bed” and “night 
stand” in it. Intuitively, understanding the scene classes 
involving object-level information would suppress the large 
diversity on scene images and lead to better generalization 
ability. This hypothesis is validated with a preliminary exper¬ 
iment by using object existence as feature vector to classify 
the scene classes — with a much lower dimension, the object 
existence feature allows a similar performance to the Place- 
CNN features. This result reveals that object-level informa¬ 
tion has the potential to improve scene classification. Inspired 
by human way of scene classification, we encourage the 
deep CNN to understand objects in early stage. Specifically, 
we develop a scene classification model with regularization 
of semantic segmentation based on the well-known CNN 
architecture, Alexnet 11], named SS-CNN. An example of 
our model structure is shown in Figure where the features 
learned for scene classification in SS-CNN automatically in¬ 
volves object-level information. On SUN RGB-D dataset 15], 
we train our SS-CNN and show it significantly outperforms 
the original Alexnet, which further validate our hypothesis 
that the semantic regularization enhances the generalization 
ability. Besides, SS-CNN achieves superior results compared 
to the state-of-the-art Place-CNN, which is also based on 
Alexnet but gains its power with 2.5 million training images. 
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Fig. 2. An example architecture of the proposed SS-CNN, which is composed of a main branch for scene classification and a regularizer branch for 
semantic segmentation. The semantic regularization is imposed to the beginning 4 layers in this figure. The main branch outputs 1-D probability prediction 
for each image, where the regularization branched outputs 2-D probability prediction for each pixel. A refinement process denoted as the dashed orange 
line is implemented in the test process to promote the performance of the semantic segmentation using the predicted scene labels. 


while SS-CNN is only trained with 5 thousand images. 

In addition, we develop a refinement method for semantic 
segmentation with the predicted scene classes, which is 
based on scene-object co-occurrences learned from training 
data. For instance, knowing the scene as a “bedroom” could 
prevent us from misclassifying the object “cabinet” as a 
“fridge”. As a result, the performance of semantic segmenta¬ 
tion also reaches the state-of-the-art on SUN RGB-D dataset. 

After training and validation on SUN RGB-D dataset 
collected in the US, we further apply our SS-CNN to a 
mobile robot to classify scenes in a building of our university, 
in Australia. The mobile robot and some RGB images with 
its predicted results are given in Figure The promising 
performance of SS-CNN in the completely new environment 
reveals its potential capability in robotics applications. 

The remainder of the paper is organized as follow: Sec¬ 
tion 1^ gives a review of related works and Section |I^ 
gives the preliminary experiment to validate our hypothesis. 
The proposed model and refinement method are given in 
Section IlYl and the experimental results are shown in Sec¬ 
tion |V] Finally, we conclude the paper with future direction 
of research in Section [Vll 

II. Related works 

Some previous works have demonstrated that interaction 
between scene and objects have the capability to promote 
each other t6]-19]. The typical idea is to build the re¬ 
lationship between scenes and objects using a graphical 
model such as Markov Random Field or Conditional Random 
Field 110]. Though these works have achieved superior 
results, they are based on hand-crafted features, which means 
the feature extraction and classification in these works are 
not in a unified optimization framework. Compared to these 
works focusing on simultaneously labeling, our work is more 
close to Object Bank 111] since we focus on regularizing 
scene classification with semantic segmentation. Object Bank 
proposed a high-level representation for scene classification 


by encoding the images with combination of a large amount 
of object detectors. However, the feature extraction and scene 
classification in Object Bank are still optimized separately, 
and it requires pre-training a large number of object de¬ 
tectors. Recently, the superior results achieved with deep 
learning methods suggest that learning features with a fully 
trainable architecture may be a better choice. In this paper, 
we implemented the scene classifier using a fully trainable 
deep architecture with a single semantic segmentation branch 
encoding all object-level information. 

As for the conventional deep learning methods, the most 
successful CNN model in scene classification is Place- 
CNN 14], which is trained on 2.5 million labeled images 
belonging to 476 scene classes using the well-known archi¬ 
tecture Alexnet 11]. Before 14], the performance of CNN on 
scene classification was within the range of performances 
of some hand-crafted features based implementations. As 
pointed out in 14], one reason of the relatively poor result 
on scene classification of CNN is due to the larger diversity 
of scene-centric images compared to object-centric images, 
which means scene classification has higher requirement on 
generalization ability. By encouraging CNN to classify the 
scene through implicit understanding of object existence, 
we developed a scene classifier also based on Alexnet and 
achieves better generalization ability than Place-CNN with 
only 5 thousand training images. 

To the best of our knowledge, considering multiple tasks is 
rarely exploited in deep learning methods. An exception is 
the refinement in DeepID-Net 112], which took the image 
classification result to refine the object detection. More 
specifically, they introduced another separated network for 
image classification and concatenated the estimated image 
probability with the estimated object probability for a further 
classification, which means the information of two tasks 
are only combined after independent training, instead of 
simultaneous training as implemented in this paper. 



























TABLE I 

Scene classieication accuracy based on object occurrence 

WITH COMPARISON TO BENCHMARK. 


Method Acc (%) 

GIST + RBF Kernel SVM [5] Wr~ 

Place-CNN + Linear SVM [5] 35.6 

Place-CNN + RBF Kernel SVM [5] 38.1 

Object occurrence + Linear SVM 33.1 


III. Exploratory study 

To explore the role that object-level information plays in 
scene classification, we first conduct a preliminary experi¬ 
ment to classify a scene with objects occurrence knowledge. 
We assume the ground truth of object occurrences is known 
in every image, then each image can be represented by a 
binary encoded vector with length M^, with 1 denotes the 
object is contained in the image and 0 otherwise, and Mo is 
the number of object classes in the whole dataset. 

We test this approach on the SUN RGB-D dataset [5], 
a recently published indoor dataset with dense annotations 
for both objects and scenes. Here the coarse annotation 
of semantic segmentation is adopted where Mq = 37. 
We follow the same split configuration as in [5] to test 
the performance and Table |T| shows the scene classification 
results based on different features. A brief introduction of 
comparing methods are given in Section |V| Here both GIST 
and Place-CNN features are extracted from RGB image. 

Experimental results show that object occurrences sig¬ 
nificantly outperforms the hand-crafted feature GIST, and 
reaches a similar level to the Place-CNN. It is to be noted 
that the dimension of object occurrence feature is much lower 
than both features of GIST and Place-CNN. This experiment 
reveals that the knowledge on object level have the potential 
to promote the performances of scene classification, which 
inspires us to consider learning scene classification involved 
with object-level information. 

IV. Model design 

With the aim of learning scene features involved object 
information, we construct our SS-CNN for scene classi¬ 
fication with regularization of semantic segmentation. In 
this section, the network architecture of our SS-CNN is 
introduced in detail, followed by the model learning and 
input construction. On top of that, we implement refinement 
for semantic segmentation with the predicted scene labels. 

A. CNN for scene classification with semantic segmentation 
regularization 

Notation. We first clarify the symbols used in this paper. 
Assume there are Mg scene classes in scene classification 
and Mo object classes in semantic segmentation. Let’s denote 
the data structure of a single sample as {X,ys,Yo), where 
X G is input image with H as height, W as width 

and C as number of channels, ys G is the ground 

truth of a scene label encoded in 1-of-K encoding scheme, 
i.e. = 1 if X belongs to scene class, otherwise 


= 0. Yo E ^^xVUxMo ig ground truth of semantic 
segmentation label having the same height and width with 
X. Analogously, yl^^ = 1 denotes the pixel (i, j) belonging 
to object class. 

Network architecture. Eor model construction of SS- 
CNN, a conventional CNN model is employed as the basic 
model to predict the scene classes with input pair {X,ys). 
Then the major contribution of this paper is to add another 
fully convolutional branch [13] to the basic model, with the 
aim of estimating Yo for semantic segmentation. The fully 
convolutional branch can be added to the main branch on 
arbitrary layer, we further define SS-CNN-Rn to denote the 
different configurations of SS-CNN as follow: 

Given an original CNN for scene classification with Ni 
layers in all, denote SS-CNN-Rn as the SS-CNN with the 
previous n layers regularized by semantic segmentation, n 
is ranging from 1 to Ni. 

In this paper, we take the well-known Alexnet archi¬ 
tecture [1] as our main branch for scene classification. In 
Alexnet, we have Ni = S and there are 8 invariants of SS- 
CNN. The detailed network configuration of some typical 
networks are given in Eigure 

Intuitively, how many layers are regularized by semantic 
segmentation would influence the performance of SS-CNN. 
If n is small, then the regularization is only imposed to a few 
layers of the scene classification. Considering the extreme 
case with n = 0, then two separate neural networks are con¬ 
structed for scene classification and semantic segmentation 
respectively. As n getting larger, the semantic segmentation 
regularizes more layers in the main branch. 

It is to be noted that the main branch keeps its original 
structure from SS-CNN-Rl to SS-CNN-R5 with 5 convolu¬ 
tional layers and 3 fully connected layers. Beginning from 
SS-CNN-R6, the structure of the main branch is slightly 
different, as the fully connected layers in main branch are 
also casted into convolutional layers one by one. When 
n = 8, fc6 and fc7 are both casted into convolutional 
layer, thus two additional fully connected layers are built for 
scene classification. 

Model learning. As can be seen from the SS-CNN 
architectures, the loss function of our SS-CNN is composed 
of two parts, one is the loss of scene classification and the 
other is the semantic segmentation. In this paper, we use 
the multinomial logistic loss on a softmax layer. The loss 
function of scene classification is: 

Ms 

L,cene = -Y.y>9{y;) 

k=l 

where is the probability of estimating X in class k, which 
is obtained with the final softmax layer taking f as input: 


Analogously, we can obtain the probability of each pixel 
in semantic segmentation branch and define the loss function 





(a) SS-CNN-R6 



(b) SS-CNN-R8 

Fig. 3. Examples of SS-CNN-Rn with n = 6, 8. Note that the structure in Figure]^ is SS-CNN-R4. The main branch in SS-CNN-R4 has the same 
structure with Alexnet with 5 convolutional layers and 3 fully connected layers, while SS-CNN-R6 have 6 convolutional layers and 2 fully connected 
layers. SS-CNN-R8 is more special with its 8 convolutional layers and 2 additional fully connected layers in the main branch. 
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L'object EEE vTlogipT) 

i j k=l 

Then the loss of the whole network is composed of these 
two losses as: 



(a) RGB Image 


(b) Depth Image 


(c) Normal Vector 


Lss — Eocene “1“ 

where a is the weight of the regularization term Lobject • 
Notice that each image is corresponding to a single cost for 
scene classification, while the cost of semantic segmentation 
is summarized over all pixels (not normalized in our model), 
thus we take a fixed weight a = 1/1000 to balance this two 
costs in this paper. 

We use stochastic gradient descent with momentum for 
model training. Note that given SS-CNN-Rn, only weights 
from layer 1 to layer n are regularized with semantic 
segmentation, i.e. tuned with respect to the partial gradient 
of Lss- From layer n + 1, the weights in scene classification 
branch is tuned with only respect to Lscenes and the same 
for the semantic segmentation branch as being tuned with 
respect to aLobject- 

Depth representation. Depth information is important in 
scene understanding. Many successful models are built on 
RGB-D inputs captured by the affordable RGB-D sensors 
such as Kinect and X-tion, especially in indoor environ¬ 
ments [14], [15]. In this paper, we also explore the effective 
ways to encode the depth information in deep CNN. 

The most direct way of considering depth information in 
deep CNN is to add a depth channel in the input layer. The 
depth image we use is linearly rescaled to [0, 255], which is 
in the same range as the RGB image. Since depth image 


Fig. 4. An example of the RGB image in SUN RGB-D dataset, with its 
corresponding depth image and normal vector image. 

only provides information of distances, we also consider 
using the knowledge of normal vectors. For estimation of 
normal vectors, the depth image is first applied to a bilateral 
filter for smoothing. And then the smoothed depth image 
is transformed into a point cloud with the camera intrinsic 
parameters, on which normal vector is estimated. The normal 
vector is also rescaled to [0, 255] and represented in an image 
with 3 channels. A visualization of the RGB image and its 
corresponding depth image and normal vector is given in 
Figure 

In this paper, we encode the depth representation as a 
combination of depth image and normal vector image, and 
then the RGB-D input has 7 channels for each image as 
shown in Figure 

B. Refinement of semantic segmentation with scene classifi¬ 
cation 

Intuitively, scene classes can provide prior information 
about object occurrences. This idea could be used to further 
refine the performance of the semantic segmentation. For 
example, if a robot recognizes an environment correctly as 
a bedroom, then it is fair to expect a bed in the image. 










































rather than a shower curtain. Based on the architecture of 
SS-CNN, we can conveniently incorporate the estimated 
scene probability to refine the performance of semantic 
segmentation. 

As pointed out in the softmax layer generates the 
estimated probability of scene classification. Let’s denote 
Ps G = [Ps, • • • the probability vector. Sim¬ 

ilarly, the probability of semantic segmentation is denoted 
as Po G Then the refinement process can be 

represented as follow: 

Pso — Ps ^ (2) 

Po = Pso ^ Po (3) 

where Wgo ^ ^^sxMo jg refinement matrix learned 
from training data, Pso represents the prior probability 
of objects learned from estimated scene classes, which is 
propagated to Po through multiplication with broadcasting 
(denoted as ( 8 ) in this paper), i.e. broadcast the Mq values 
in Pso to each score map in Ps respectively. The refinement 
process is illustrated in Figure 

For the refinement matrix Wso, h is constructed based 
on the scene-object co-occurrence distribution in training 
dataset. Rather than directly decide the refinement matrix 
from the object frequency, we propose to construct Wgo in 
a way similar to term frequency-inverse document frequency 
(tf-idf). Inspired by the inverse document frequency term in 
tf-idf, how important an object is in a scene is also con¬ 
sidered. For example, the object classes “wall” and “fioor” 
are most common ones and almost appear in every scene. 
These common classes are usually ignored. When we want 
the robot to finish a certain task such as “find the bowl in the 
kitchen”, these common classes are less meaningful in the 
context of semantic segmentation, while the training process 
actually pays more attention to these classes because of their 
large amount of training samples. 

Let’s first construct the original term frequency matrix f G 
^MsxMo^ where fij denotes the count of object j occurs in 
scene i. And then the term frequency is normalized as: 

tfij = log{l + fij) 

In this paper the inverse document frequency is con¬ 
structed as: 


Finally, the Wij in weight matrix Wgo is constructed by 
the multiplication of these two terms with normalization as: 

Wij = log{l + tfij X idfj ) 

where i = 1 , • • • , Mg, j = 1 , • • • , Mq. If Wij = 0 , we set 
Wij = in case that the training dataset cannot exactly 
represent scene-object occurrences of the test dataset. 

V. Experiments and results 

We first train and validate our SS-CNN on the SUN RBG- 
D dataset [5], which is an indoor dataset with 10335 RGB-D 
images in all. In [5], the benchmark of scene classification 



Po 


Fig. 5. Illustration of refinement process. 

TABLE II 

Summary of SUN RGB-D dataset. 

Task Dataset #Train #Test #A11 

Scene classification Sig 4845 4659 9504 

Semantic segmentation S 45 5285 5050 10335 


is conducted on a subset of the dataset, which is composed 
of 19 scene classes with more than 80 samples, while the 
benchmark of semantic segmentation is conducted on the 
whole dataset with 45 scene classes. A summary of the 
datasets used in different tasks is given in Table [Ilj where 
the split configuration is provided in the toolbox of SUN 
RGB-D dataseQ To make a fair comparison, we also validate 
the scene classification performance on Siq and validate the 
semantic segmentation performance on S 45 . For both cases, 
SS-CNN is trained with only the training images in SUN 
RGB-D dataset, without other data augmentation. 

On top of the model trained and validated on SUN RGB- 
D dataset, we experimentally test the performance of SS- 
CNN on a set of test images collected in a building of our 
university using a mobile robot. 

A. Experimental setup 

During the training process, we resize both input im¬ 
ages and semantic segmentation ground truth to 210 x 158 
for computation efficiency. Let’s denote the resized image 
datasets as .§19 and .§45 respectively. 

To predict the pixel-wise labels in the semantic segmen¬ 
tation branch, we construct our SS-CNN based on a slightly 
modified Alexnet. The receptive field of the original Alexnet 
is 224 X 224, with pixel stride 32. Intuitively, large stride 
leads to coarse semantic segmentation results. Smaller stride 
is obviously required for semantic segmentation in our work 
since the image size we use is 210 x 158. In [13], the author 
implemented a fusion technique named “deep jet” for finer 
segmentation results. Instead of fusing results from multiple 
layer such as using “deep jet”, we choose to slightly modify 
the configuration of Alexnet to directly get a network with 
stride 16 and receptive field 81 x 81. The rationale is this 
paper focuses on validating the effectiveness of semantic 
regularization on scene classification rather than obtaining 
a finer semantic segmentation. Besides, the length of fc7 

^http://rgbd.cs.princeton.edu. 
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Fig. 6. Comparison of SS-CNN-Rn on scene classification with n = 
2, 4, 6, 8. The performance on the original Alexnet is given as a baseline. 


is reduced to 512 while the original length in Alexnet is 
4096, which is also illustrated in Figure We believe the 
performance of SS-CNN can be further improved with higher 
resolution of training images and more parameters in fc7. 

Our network is implemented on Caffe [16], a popular deep 
learning framework. For model learning, we use stochastic 
gradient descent with momentum to train the randomly 
initialized network, and the size of each minibatch is 20. 
The learning rate is fixed as 10“^ during the training process, 
and the momentum is fixed as 0.9. Similar to the common 
configuration in training deep neural networks, we use a 
weight decay of 5“^, and double the learning rate of biases. 
We also employed dropout in the fully connected layers. We 
are planning to release our model in the near future. 

B. Evaluation of semantic regularization 

To evaluate the effectiveness of our semantic regulariza¬ 
tion, we first make a comparison between our SS-CNN- 
Rn and the basic Alexnet, where layer 1 to layer n in 
SS-CNN-Rn are regularized by semantic segmentation cost 
as introduced in Section IIV-AI Both SS-CNN-Rn and the 
original Alexnet are trained with the same training data in 
a§ 19 , and only RGB images are considered in this evaluation 
experiment. 

The models we compared include SS-CNN-R2, R4, R6 
and R8. Comparison result is shown in Figure which 
demonstrates SS-CNN-Rn considerably outperforms the 
original Alexnet for each n. It reveals the generalization 
ability on scene classification is significantly improved with 
the regularization of semantic segmentation. 

By analyzing the influence of n in SS-CNN-Rn, we 
can further gain insights in how the generalization ability 
is enhanced with the regularization. Figure shows the 
performance of SS-CNN-Rn experiences slight promotion 
with increased n from 2 to 6. It can be explained that when 
n is small, the regularization is added only to the early layers 
of main branch, which means the low-level features are 
regularized. As n is increasing, the features being regularized 
become more abstract, and even object-level features would 
start to emerge in higher layers with the regularization on 
semantic segmentation. However, it can be seen that the 
performance of SS-CNN-R8 has an apparent drop. One 
possible reason is that SS-CNN-R8 directly classifies the 



(a) Alexnet (b) SS-CNN-R6 


Fig. 7. Confusion matrices of Alexnet and SS-CNN-R6, both are trained 
with only training images in SUN RGB-D dataset. 

scene based on semantic segmentation results, in which the 
performance of scene classification would suffer from the 
misclassification of semantic segmentation. For better illus¬ 
tration, the confusion matrices of our best model SS-CNN- 
R6 and Alexnet is given in Figure which demonstrate the 
scene classification result is considerably improved with the 
semantic regularization. 

C. Validation on scene classification 

In this section, we make comparison between our SS- 
CNN and the benchmark methods of scene classification 
on S'lg [5], with 4845 training samples and 4659 validation 
samples as shown in Table UDit is to be noted that our SS- 
CNN is trained and evaluated on the resized ^ig, and it does 
not harm the fairness in the task of scene classification. The 
introduction of our comparing methods is given as follow: 

• GIST [17] -f SVM. GIST is a famous descriptor for 
modeling a scene image, which summarizes the gradient 
information of a given image. An RBF kernel SVM is 
employed for classification. 

• Place-CNN [4] -f SVM. As introduced in Section |II| 
Place-CNN is pre-trained on 2.5 million scene images 
using Alexnet. Because of its pre-trained structure, 
Place-CNN is usually employed as a feature extraction 
method in scene classification applications and an ad¬ 
ditional classifier is required for classification. In [5], 
both Linear SVM and RBF Kernal SVM are considered 
to train and classify the Place-CNN features extracted 
from 5'ig, and the later achieves the state-of-the-art 
performance. 

• Alexnet. As both Place-CNN and SS-CNN are based 
on Alexnet, the performance of the original Alexnet 
is also evaluated as a baseline. The Alexnet we im¬ 
plement is trained with only training images in Siq 
from randomly initialized weights. Dislike the separate 
feature extraction and classification required in Place- 
CNN model, the Alexnet we trained directly classifies 
the scene classes using the softmax classifier within the 
network architecture. 

• SS-CNN. As suggested in Figure SS-CNN-R6 is 
the best configuration and thus is employed in this 
comparison. Our SS-CNN-R6 is also trained with only 
training images in Siq and the softmax classifier is 
employed for classification. 













TABLE III 

Scene classieication comparison. 


TABLE IV 

Semantic segmentation comparison. 


Model 

Input 

Acc (%) 

Dataset 

Model 

Input 

Acc (%) 

GIST + 

RGB 

19.7 



RGB 

27.77 

RBF Kernel SVM [5] 

RGB -h D 

23.0 

*§45 

SS-CNN-R6 

RGB + D 

37.03 

Place-CNN + 

RGB 

35.6 



RGB + D refined 

41.76 

Linear SVM [5] 

RGB -h D 

37.2 


NN [5] 

RGB + D 

8.97 

Place-CNN + 

RGB 

38.1 

*945 

SIFT Flow [5] 

RGB + D 

10.05 

RBF Kernel SVM [5] 

RGB -h D 

39.0 

KDES [5] 

RGB + D 

36.33 

Alexnet 

RGB 

24.3 


SS-CNN-R6 

RGB + D refined 

40.66 

RGB -h D 

30.7 





SS-CNN-R6 

RGB 

RGB -h D 

36.1 

41.3 


TABLE V 



Experimental validation results. 


Comparison results are given in Table III where the accu¬ 
racy is calculated as the mean accuracy of 19 scene classes. 
Table |nl] also shows the results learned from both RGB input 
and RGB-D input. For the depth information, [5] adopt the 
HHA [15] representation in GIST feature and Place-CNN 
feature. HHA is composed of horizontal disparity, height 
above ground, and the angle information. As HHA requires 
inferring of the ground and the gravity direction, our depth 
representation in Section [rV-A| is a more compact choice with 


also effective performance as shown in Table III 


It can be seen that Place-CNN gains a considerable 
promotion with pre-training on 2.5 million scene images 
compared to the original Alexnet trained with only SUN 
RGB-D dataset. For SS-CNN-R 6 which is also trained on 
SUN RGB-D dataset, it achieves superior results taking 
advantage of the regularization on semantic segmentation, 
which is slightly better than the Place-CNN with our RGB- 
D input. The results further validate our hypothesis that 
the generalization ability of deep neural network could be 
enhanced by involving object-level information. 


D. Validation on semantic segmentation and its refinement 


We also evaluate the performance of the semantic segmen¬ 
tation, the regularize^ and its refined results. The dataset we 
use is 5'45 as shown in Table [Ilj which has 37 object classes. 
The comparing results are shown in Table |IVj the accuracy 
is calculated as the mean accuracy of all 37 objects. 

In Tabel IV we first compare the performances of SS- 
CNN-R 6 on £' 45 , i.e. the resized dataset. Results show that 
depth information significantly promotes the mean accuracy 
of semantic segmentation. Then it is further refined to 
increase the mean accuracy. In particular, the accuracies on 
“chair”, “ceiling” and “bookshelf” are significantly promoted 
with refinement. 

To make a fair comparison to the benchmark methods 
mentioned in [5], our predicted results on ^45 is directly 
resized to £' 45 , which slightly effects the mean accuracy. The 
comparing methods are listed as follow: 


• Nearest neighbor. A nonparameteric method, [5] first 
extracts features using the trained Place-CNN to repre¬ 
sent each image, and the test image directly takes the 
ground truth of the nearest neighbor in feature space as 
its segmentation label. 


Class 

#S ample 

Acc (%) 

Computer room 

41 

19.5 

Conference room 

29 

13.8 

Corridor 

38 

47.4 

Kitchen 

14 

35.7 

Office 

94 

63.8 

Rest space 

14 

57.1 

All 

230 

39.6 


• SIFT Flow [18]. Also a nonparameteric method which 
takes the SIFT fiow matching algorithm to search the 
match images from dataset with available semantic 
segmentation. 

• Kernel Descriptors (KDES) [19]. A state-of-the-art 
method which encodes the input with kernel descrip¬ 
tors and the contextual information is considered with 
superpixel MRF and segmentation tree. 

As can be seen in Table ||V| on dataset £' 45 , we also achieve 
the state-of-the-art performance on semantic segmentation 
with the SS-CNN-R 6 . We illustrate some examples of our 
predicted semantic segmentation labels with their refined 
results in Figure 

E. Experimental validation 

The experiments on the publicly available SUN RGB- 
D dataset demonstrates the effectiveness of our SS-CNN. 
To further validate the performance of SS-CNN in robotics 
related application, we conducted an experiment using our 
mobile robot. The robot moved around in one of our uni¬ 
versity buildings and collected 230 RGB-D images with an 
on-board Kinect V2, belonging to 6 scene classes. 

For scene classification, we use the SS-CNN-R 6 training 
on SUN RGB-D dataset to predict the scene classes in the 
collected images without retraining the network with images 
in the new environment. To adapt to the SS-CNN-R 6 , each 
collected image is also represented as catenation of RGB 
image, depth image and normal vector image. The predicted 
results are given in Table |Vj where the mean accuracy of all 
these 6 classes are given at the bottom row. Figure gives 
some example RGB images with their predicted labels. It is 
to be noted that some images in this dataset is challenging 
even for humans since the boundary between some scene 
classes are not very clear. The last row in Figure gives 
two examples in this situation, the ground truth of these two 


















Fig. 8. Illustration of semantic segmentation and its refinement. From left to right: RGB input, ground truth of semantic segmentation, predicted results 
of SS-CNN-R6, refined predicted results of SS-CNN-R6. White color in the ground truth images denotes the background or confusing region and not 
considered either in training nor test. It can be seen that refinement not only plays the role of smoothing, but also “strengthens” some specific objects in 
the corresponding scene. 


images are “computer room” and “rest space” respectively, 
while they are denoted with “office” and “discussion area”. 

As can be seen from Table |Vj the predicted results are 
in similar order to the validation results on SUN RGB-D in 
the completely new environment, which further demonstrates 
the generalization ability of our SS-CNN. Therefore, our SS- 
CNN has the potential to be implemented in real robotics 
applications without further training. 

VI. Conclusion 

In this paper, we address the scene classification problem 
using deep learning methods with a much smaller amount 
of training images, by regularizing deep architecture with 
semantic segmentation. Experimental results validate the 
effectiveness of the regularization as SS-CNN achieved the 
state-of-the-art results on both scene classification and se¬ 
mantic segmentation on the publicly available SUN RCB- 
D dataset. Further experiments on our robot demonstrates 
the generalization ability of the proposed approach. For 
the future work, we would like to investigate the potential 
possibility in both horizontal and vertical dimensions, which 
means to couple more relevant tasks, and to find better 
architecture to incorporate the relations between these tasks. 

References 

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 
with deep convolutional neural networks,” in Advances in neural 
information processing systems, pp. 1097-1105, 2012. 

[2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and 
T. Darrell, “Decaf: A deep convolutional activation feature for generic 
visual recognition,” arXiv preprint arXiv:1310.1531, 2013. 

[3] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn 
features off-the-shelf: an astounding baseline for recognition,” in 
Computer Vision and Pattern Recognition Workshops ( CVPRW), 2014 
IEEE Conference on, pp. 512-519, IEEE, 2014. 

[4] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning 
deep features for scene recognition using places database,” in Advances 
in Neural Information Processing Systems, pp. 487-495, 2014. 

[5] S. Song, S. R Lichtenberg, and J. Xiao, “SUN RGB-D : A RGB-D 
Scene Understanding Benchmark Suite,” CVPR, 2015. 


[6] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: 
Joint object detection, scene classification and semantic segmentation,” 
in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE 
Conference on, pp. 702-709, IEEE, 2012. 

[7] D. Lin, S. Fidler, and R. Urtasun, “Holistic scene understanding for 
3d object detection with rgbd cameras,” in Computer Vision (ICCV), 
2013 IEEE International Conference on, pp. 1417-1424, IEEE, 2013. 

[8] R. Luo, S. Piao, and H. Min, “Simultaneous place and object recogni¬ 
tion with mobile robot using pose encoded contextual information,” in 
Robotics and Automation (ICRA), 2011 IEEE International Conference 
on, pp. 2792-2797, IEEE, 2011. 

[9] J. G. Rogers III, H. Christensen, et al, “A conditional random field 
model for place and object classification,” in Robotics and Automation 
(ICRA), 2012 IEEE International Conference on, pp. 1766-1772, 
IEEE, 2012. 

[10] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random 
fields: Probabilistic models for segmenting and labeling sequence 
data,” 2001. 

[11] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high- 
level image representation for scene classification & semantic feature 
sparsification,” in Advances in neural information processing systems, 
pp. 1378-1386, 2010. 

[12] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, 
S. Yang, Z. Wang, C.-C. Loy, et al, “Deepid-net: Deformable deep 
convolutional neural networks for object detection,” in Proceedings of 
the IEEE Conference on Computer Vision and Pattern Recognition, 
pp. 2403-2412, 2015. 

[13] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks 
for semantic segmentation,” arXiv preprint arXiv:1411.4038, 2014. 

[14] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor 
semantic segmentation using depth information,” arXiv preprint 
arXiv:1301.3572, 2013. 

[15] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich 
features from rgb-d images for object detection and segmentation,” 
in Computer Vision-ECCV 2014, pp. 345-360, Springer, 2014. 

[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, 
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for 
fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. 

[17] A. Oliva and A. Torralba, “Modeling the shape of the scene: A 
holistic representation of the spatial envelope,” International journal 
of computer vision, vol. 42, no. 3, pp. 145-175, 2001. 

[18] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing: 
Label transfer via dense scene alignment,” in Computer Vision and 
Pattern Recognition (CVPR), 2009 IEEE Conference on, pp. 1972- 
1979, IEEE, 2009. 

[19] X. Ren, L. Bo, and D. Fox, “Rgb-(d) scene labeling: Features and 
algorithms,” in Computer Vision and Pattern Recognition (CVPR), 
2012 IEEE Conference on, pp. 2759-2766, IEEE, 2012. 




